Purpose

The purpose of today’s lab is not teach you everything there is to know about coding in R. It is not even to describe why the code in R works the way it does. Instead, we take an applied approach to learning R. We hope that giving you a functional understanding of R and suggesting some strategies for overcoming coding obstacles you will be able to begin playing around with the language. As with other coding languages, getting hands-on experience with R is one of the fastest ways to learn the language.

To that end, today’s lab will cover:

  1. How to download and install R and RStudio
  2. The panes of RStudio
  3. How to create and use R Markdown Documents
  4. (Some of) the different types of variables in R
  5. Functions
  6. How to install and load packages and…
  7. How to load data

After we have covered the lab content, we will move on to Minihacks. Minihacks are small coding projects intended to test your knowledge of the day’s material. The minihacks will be similar to, but slightly narrower in focus, than the homework questions. If you are able succesfully complete all the minihacks, you should be well equipped to tackle the homeworks.


Getting Started

So what is R?

In the simplest possible terms, R is a programming language used for conducting analyses and producing graphics. It is substantially more flexible than GUI-based statistics programs (e.g., SPSS, LISREL), but less flexible compared to other programming languages. In this case, its lack of flexibility is an asset; it allows the code to written in a far more efficient and intuitive way than other programming languages. R is relatively new compared to other programming languages, but it is quickly becoming one of the most used languages by data scientists.

Only one piece of software is required to get started using the R programming language and, confusingly, it is also called R. I will refer to it here as the R Engine. The R Engine essentially allows the computer to understand the R programming language, turning your lines of text into computer operations. Unlike other popular statistics programs (e.g., SPSS, SAS), the R Engine is free. Instructions for downloading the R Engine are below.

A second piece of software that is not required to use R but is nonetheless useful is RStudio. RStudio is an integrated development environment (IDE) or, in potentially overly simplistic terms, a tool that makes interacting with the R Engine easier. Instructions for downloading the RStudio are below.

Downloading the R Engine

  1. Navigate to the webpage for the Comprehensive R Archive Network (commonly referred to as CRAN).
  2. Under “Download and Install R” click the appropriate link for your operating system. I am using a Mac, so I would click Download R for (Mac) OS X.
  3. Click the link for the latest release. As of writing this, the newest package is R 3.6.1. "Action of the Toes" (all version nicknames are references to the Peanuts comic strip). I would click R-3.6.1.pkg to start the download.
  4. Once the file is downloaded, click on it to open it. Your operating system should guide you through the rest of the installation process.

Note. The same steps are used to update the R Engine: You install a new version, replacing the old version in the process.

Downloading RStudio

  1. Navigate to the webpage for the free version of RStudio. For our purposes (and, in fact, for most people’s purposes) the free version is all that you need. The available installers are listed at the bottom of the page under the heeader “Installers for Supported Platforms.”
  2. Select the installer for your operating system. Since I am using a Mac, I would click RStudio 1.2.1335 - macOS 10.12+ (64-bit).
  3. Once the file is downloaded, click on it to open it. Your operating system should guide you through the rest of the installation process.

Note. To update RStudio after it is already installed, all you have to do is navigate to Help > Check for Updates in the menubar.


Features of RStudio

As shown in the image below, an RStudio session is split into four sections called panes: the source pane, the console, the environment/history pane, and the succinctly named files/plots/packages/help pane.

The Console

In RStudio, the console is the access point to the underlying R-engine. It evaluates the code you provide it (including code called using the the source pane). You can pass commands to the R-engine by typing in commands after the >.

Source

The source pane shows you a collection of code called a script. In R, we primarily work with R Script files (files ending in .R) or R Markdown files (files ending in .Rmd). In this class, we will mostly be working with R Markdown files. In fact, the document you are reading right now was made with R Markdown.

Environment/History

The environment and history pane shows, well, your environment and history. Specifically, if you have the “Environment” tab selected, you will see a list of all the variables that exist in your global environment. If you have the “History” tab selected, you will see previous commands that you passed to the console.

Files/Plots/Packages/Help

The final pane—the files/plots/packages/help pane–includes a number of helpful tabs. The “Files” tab shows you the files in your current working directory, the “Plots” tab shows you a preview of any plots you have created, the “Packages” tab shows you a list of the packages currently installed on your computer, and the “Help” tab is where help documentation will appear. We will discuss packages and help documentation later in this lab.


R Markdown

As noted above, you will mostly be using R Markdown documents in this course (i.e., it is required that your homeworks be created using R Markdown documents).

Creating an R Markdown Document

  1. Click on the blank piece of paper with the plus sign over it in the upper left-hand corner of RStudio.

  2. Click on R Markdown....

  1. Enter the title of document and your name. For the purposes of this example, I have chosen to title the document lab_1 and I have chosen–for the 9,701st day in a row–to name myself Cameron.

Congratulations! You now have an R Markdown document!

Using an R Markdown Document

The content of R Markdown documents is split into main types. I will call the first type simple text. Simple text will not be evaluated by the computer other than to be formatted according to markdown syntax. If you are answering a homework question or interpreting the results of an analysis, you will likely be using simple text. The markdown syntax allows you to format the text such as italicizing words when they are enclosed in asterisks (e.g., *this is italicized* becomes this is italicized) or bolding words when they are enclosed in double-asterisks (e.g., **this is bold** becomes this is bold). For a quick run down of R Markdown formatting, I suggest you check out the Markdown section of the R Markdown Cheat Sheet.

In addition to simple text, R Markdown documents support blocks (also called chunks) of R code. The blocks of R code are evaluated by the computer. R code chunks are surround by ```{r} and ``` to tell the computer that the contents should be evaluated. In the example image below, the 1 + 2 in the R Code chunk will be evaluated when the document is “knitted” (rendered). A code chunk is where you will write the analyses for your homeworks.

Knitting an R Markdown Document

In order to knit an R Markdown document, you can either click command + shift + k or click the button at the top of the R Markdown document that says Knit. The computer will take several seconds (or, depending on the length of the R Markdown document, several minutes) to knit the document. Once the computer has finished knitting the document, a new document will appear in the same location that the R Markdown document is saved. In this example, the new document will end with a .html extension.

As shown in the above image, the simple text in the R Markdown document on the left was rendered into a formatted in the knitted document on the right. The equation in the code chunk was also evaluated in the knitted document, returning the value 3.


The Basics of Coding in R

Arithmetic commands

As mentioned above, you can pass commands to the R-engine via the console. R has arithmetic commands for doing basic math operations, including addition (+), subtraction (-), multiplication (*), division (/), and exponentiation (^).

R will automatically follow the PEMDAS order of operations (BEDMAS if you are Canadian). Parentheses can be used to tell R what parts of the equation should be evaluated first. As shown below (and as expected), (10 + 5) * 2 is not equivalent to 10 + 5 * 2.

Creating Variables

You can create variables using the assignment operator (<-). Whatever is on the left of the assignment operator is saved to name specified on the right of the assignment operator. I like to imagine that there is a box with a name on it and you are placing a value, inside of the box. For example, if we wanted to place 10 into a variable called my_number we would write:

If we want to see what is stored in my_number, we can simply type my_number into the console and press enter. It is essentially asking the computer, “What’s in the box?”

If we want to overwrite my_number with a new value, we simply assign a new value to my_number.

Looking at my_number again, we can see that it is now 20.

You can treat variables (e.g., my_number) just like you would the underlying values. For example, you could add 5 to my number.

Keep in mind, the above operation does not save the result of my_number + 5 to my_number. To do that, you would have to assign the result of my_number + 5 to my_number.

Finally, if we want to remove a variable from our environment, we can use rm().

Types of Variables

Variables can be of four types (also shown in the table below): (1) logical values (also called booleans) that can be either TRUE or FALSE, (2) integer values, which are denoted by the suffix L and are whole numbers (i.e., numbers without decimal values), (3) doubles, which arenumbers that include values before and after the decimal point (most numeric values in R default to doubles), and (4) character values (also called strings), which are pieces of text enclosed in quotation marks (").

Type Examples
Logical/Boolean TRUE, FALSE
Integer 10L, -10L
Double 10.50, -10.50
Character "Hello", "World"

Vectors

Atomic Vectors

A collection of values is called a vector. If they are all of the same type, we call them atomic vectors. In R, we use the c() command to concatenate (or combine) values into an atomic vector.

Just as we did with the scalar values above, we can assign a vector to a variable.

To print out the entire vector, we can simply type my_vector into the console.

In order to select just one value from the vector, we use square brackets ([]). For example, if we wanted the third value from my_vector we would type my_vector[3]1.

If we want to replace a specific value in a vector, we use the assignment operator (<-) in conjunction with the square brackets ([]).

As with single-value objects, we can perform arithmetic operations on vectors, but the behaviour is not identical to that for single values. If the vectors are the same length, each value from one vector will be paired with the corresponding value from the other vector. See below for an example of this in action.

If the vectors of different lengths, the shorter vector will be recycled (i.e., repeated) to be the same length as the longer vector.

This also works when the longer vector is not a multiple of the shorter vector, but you will get the warning: longer object length is not a multiple of shorter object length.


1. Unlike most other coding languages (e.g., python), indices in R start at 1 instead of 0. For instance, if you want to select the first element of a vector, you write my_vector[1] instead of my_vector[0]. A second difference to keep in mind is that, in R, - will remove the value matching that index value from the vector rather than counting back from the end of the string. By way of illustration, if your vector was c(10, 20, 30, 40, 50, 60) using vector[-2] in R would return c(10, 30, 40, 50, 60) and in python it would return 50.

Lists

A vector that can accomodate more than one type of value (e.g., a double AND a character) is called a list. To create a list, we use list() instead of c(). For instance, if we wanted to create a vector with the values 10, "hello", and TRUE we would use list(10, "hello", TRUE) as shown below.

Although lists are an incredibly powerful type of data structure, dealing with them can be quite frustrating (especially for beginning coders). Since you are unlikely to need to know the inner workings of lists for anything we will be doing in this course, I have chosen not to include much about them here. However, as you become a more advanced user, learning to leverage lists will make your code far more efficient.

Data Frames

In R you will mostly be working with data frames. A dataframe is technically a list of atomic vectors. For our purposes, we can think of data frames as a spread sheet that has columns for variables and rows for observations.

Let’s look at a data frame that is automatically loaded when you open R, mtcars. Type mtcars to print out the data frame.

The data frame mtcars has a row for 32 cars featured in the 1974 Motor Trend magazine. There is a row for the car’s miles per gallon (mpg), number of cylinders (cyl), engine displacement (disp), horse power (hp), rear axle ratio (drat), weight in thousands of pounds (wt), quarter-mile time (qsec), engine shape (vs), transmission type (am), number of forward gears (gear), and number of carburetors (carb).

With data frames, you can extract a value by including [row, col] immediately after the object. For example, if we wanted to extract the value for the number of gears in the Datsun 710 we could use mtcars[3, 10] to extract the value of the third row in the tenth column.

mtcars[3, 10]
## [1] 4

Since the rows and columns have names, we can also be explicit and use the name of the row and the name of the column instead of the row and column indices.

mtcars["Datsun 710", "gear"]
## [1] 4

If you want to extract an entire column or an entire row, you can drop the index value for the row or column, respectively. Since you don’t specify a given row or column that you want, the computer assumes you want all values. For example, to extract all values stored in the gear column, we drop the row index (e.g., [, 10] or [, "gear"]).

mtcars[, 10]
##  [1] 4 4 4 3 3 3 3 4 4 4 4 3 3 3 3 3 3 4 4 4 3 3 3 3 3 4 5 5 5 5 5 4
mtcars[, "gear"]
##  [1] 4 4 4 3 3 3 3 4 4 4 4 3 3 3 3 3 3 4 4 4 3 3 3 3 3 4 5 5 5 5 5 4

To extract all values associated with the Datsun 710, we drop the column index (e.g., [3, ] or ["Datsun 710", ])

mtcars[3, ]
##             mpg cyl disp hp drat   wt  qsec vs am gear carb
## Datsun 710 22.8   4  108 93 3.85 2.32 18.61  1  1    4    1
mtcars["Datsun 710", ]
##             mpg cyl disp hp drat   wt  qsec vs am gear carb
## Datsun 710 22.8   4  108 93 3.85 2.32 18.61  1  1    4    1

You can also extract columns using $ followed by the column name without quotes.

mtcars$gear
##  [1] 4 4 4 3 3 3 3 4 4 4 4 3 3 3 3 3 3 4 4 4 3 3 3 3 3 4 5 5 5 5 5 4

If we want to extract multiple columns (or multiple rows) we can also use vectors. For example, if we wanted number of gears and carboraters of the Datsun 710 and the Duster 360 we can use [c("Datsun 710", "Duster 360"), c("gear", "carb")].

mtcars[c("Datsun 710", "Duster 360"), c("gear", "carb")]
##            gear carb
## Datsun 710    4    1
## Duster 360    3    4

Functions

Functions are essentially pre-packaged snippets of code that make your life easier. A function takes one or more pieces of input (called arguments) and returns one or more pieces called output (called values). For example, length() is a function that takes a vector as its sole argument and returns the length of the vector as its value.

length(c(10, 20, 30, 40, 50, 60))
## [1] 6

The function unique() also takes a vector as its primary argument, but—instead of returning the length of the vector as its value—it returns the vector’s unique values.

unique(c("condition_a", "condition_a", "condition_b", "condition_a", "condition_b"))
## [1] "condition_a" "condition_b"

The mean() function and sd() function are two functions that you will end up using a lot. The former takes a numeric vector as one of its arguments and returns the average of that vector as its output. The latter also takes a numeric vector as one of its arguments, but it returns the standard deviation of that vector as its output.

mean(c(10, 20, 30, 40, 50, 60))
## [1] 35
sd(c(10, 20, 30, 40, 50, 60))
## [1] 18.70829

Although it is more conceptual, it would also be useful to mention the typeof() function here. The function typeof() takes any object and tells you what type of variable it is.

typeof(10L)
## [1] "integer"
typeof(10)
## [1] "double"
typeof("hello")
## [1] "character"
typeof(TRUE)
## [1] "logical"

Using the suite of as.*() functions (e.g., as.numeric(), as.character(), as.logical(), as.integer()), we can also coerce objects to other types.

as.numeric("10")
## [1] 10
as.character(10)
## [1] "10"
as.character(TRUE)
## [1] "TRUE"
as.numeric(TRUE)
## [1] 1
as.integer(10.30)
## [1] 10

Help Documentation

Sometimes when working in R you will want to know more about a function. For example, you might want to know what arguments the function sd takes. You can use ? at the beginning of any function call to display the help documentation.

?sd

From the help documentation we can see that sd takes two arguments: (1) An R object and (2) a TRUE or FALSE statement indicating whether NAs (unknown values) should be removed before the standard deviation is calculated.

Normally R will infer, based on the order, what values are being passed as what argument. For example, since sd() expects that the argument x will be provided first and the argument na.rm will be provided second, the following works:

sd(c(10, 20, 30, 40, 50, 60), FALSE)
## [1] 18.70829

However, we can make it explicit and tell R what values are associated with which arguments.

sd(x = c(10, 20, 30, 40, 50, 60), na.rm = FALSE)
## [1] 18.70829

The help documentation for a function also often includes an example of how to use the function and details on what the output will be.


Packages

A package can include code, documentation for that code, and/or data. A helpful way to think of packages is as a toolbox full of data analysis tools.

There are general purpose toolboxes that contain tools for running common analyses in psychology (e.g., psych), toolboxes for helping your run advanced statistical models (e.g., lavaan; lmer), toolboxes for text mining (e.g., tidytext), and toolboxes for plotting (e.g., ggplot2, gganimate). If you have a problem that needs to be solved, there will probably be a package for it.

Installing packages

To install a package onto your computer, you simply pass the name of the package to install.packages(). As a demonstration, we install the psych package below. The psych package has a panoply of useful data analysis tools for psychologists. You generally only need to install a package once.

install.packages("psych")

Note. When installing packages, the package name must be enclosed in quotes: install.packages("psych") NOT install.packages(psych).

Loading a package

Just because we’ve installed a package to our computer doesn’t mean we have access to its functions. Buying a toolbox doesn’t necessarily give you access to its tools. You also have to open the toolbox. To open the toolbox and load its tools, we use library(). Below we load the psych pacakge.

library(psych)

We have to load a package every time R is restarted.

Note. A package can be loaded with or without quotes: library("psych") OR library(psych).

Try psych commands

Now that we have installed and loaded the psych package, let’s try out of some its commands.

Using corr.test() we can make a correlation matrix of the variables in mtcars.

corr.test(mtcars)
## Call:corr.test(x = mtcars)
## Correlation matrix 
##        mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
## mpg   1.00 -0.85 -0.85 -0.78  0.68 -0.87  0.42  0.66  0.60  0.48 -0.55
## cyl  -0.85  1.00  0.90  0.83 -0.70  0.78 -0.59 -0.81 -0.52 -0.49  0.53
## disp -0.85  0.90  1.00  0.79 -0.71  0.89 -0.43 -0.71 -0.59 -0.56  0.39
## hp   -0.78  0.83  0.79  1.00 -0.45  0.66 -0.71 -0.72 -0.24 -0.13  0.75
## drat  0.68 -0.70 -0.71 -0.45  1.00 -0.71  0.09  0.44  0.71  0.70 -0.09
## wt   -0.87  0.78  0.89  0.66 -0.71  1.00 -0.17 -0.55 -0.69 -0.58  0.43
## qsec  0.42 -0.59 -0.43 -0.71  0.09 -0.17  1.00  0.74 -0.23 -0.21 -0.66
## vs    0.66 -0.81 -0.71 -0.72  0.44 -0.55  0.74  1.00  0.17  0.21 -0.57
## am    0.60 -0.52 -0.59 -0.24  0.71 -0.69 -0.23  0.17  1.00  0.79  0.06
## gear  0.48 -0.49 -0.56 -0.13  0.70 -0.58 -0.21  0.21  0.79  1.00  0.27
## carb -0.55  0.53  0.39  0.75 -0.09  0.43 -0.66 -0.57  0.06  0.27  1.00
## Sample Size 
## [1] 32
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##       mpg cyl disp   hp drat   wt qsec   vs   am gear carb
## mpg  0.00   0 0.00 0.00 0.00 0.00 0.22 0.00 0.01 0.10 0.02
## cyl  0.00   0 0.00 0.00 0.00 0.00 0.01 0.00 0.04 0.08 0.04
## disp 0.00   0 0.00 0.00 0.00 0.00 0.20 0.00 0.01 0.02 0.30
## hp   0.00   0 0.00 0.00 0.17 0.00 0.00 0.00 1.00 1.00 0.00
## drat 0.00   0 0.00 0.01 0.00 0.00 1.00 0.19 0.00 0.00 1.00
## wt   0.00   0 0.00 0.00 0.00 0.00 1.00 0.02 0.00 0.01 0.20
## qsec 0.02   0 0.01 0.00 0.62 0.34 0.00 0.00 1.00 1.00 0.00
## vs   0.00   0 0.00 0.00 0.01 0.00 0.00 0.00 1.00 1.00 0.02
## am   0.00   0 0.00 0.18 0.00 0.00 0.21 0.36 0.00 0.00 1.00
## gear 0.01   0 0.00 0.49 0.00 0.00 0.24 0.26 0.00 0.00 1.00
## carb 0.00   0 0.03 0.00 0.62 0.01 0.00 0.00 0.75 0.13 0.00
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

Using skew, we can look at the skew of all of the columns in mtcars.

skew(mtcars)
##  [1]  0.6106550 -0.1746119  0.3816570  0.7260237  0.2659039  0.4231465
##  [7]  0.3690453  0.2402577  0.3640159  0.5288545  1.0508738

We can use t2d to calculate the Cohen’s d for a t-value of 3.00 with 300 participants.

t2d(t = 3.00, n = 300)
## [1] 0.3464102

This is only a small subset of the functions available in the psych package, and psych is only one package of over 11,000 on CRAN (as of 2018). This is not to mention the tens of thousands of packages hosted on online repositories like (GitHub)[https://github.com]. As Cory Costello noted during R Bootcamp, the question in R is never if but how.

Importing Data into R

The final topic that we will cover in this lab is how to load data into R. Over the course of your grad school careers (and many times in this class) you will need to import data into R to be analyzed.

For this example, we will be using the planet’s dataset from Star Wars. The data can be downloaded here.

When I took this course, you would have to use file-type-specific functions to load data (e.g., read.csv, read_excel). The rio package streamlines this process by having a single import function (import()) that interprets the file type to be imported from its extension (e.g., .csv, .xlsx, .sav). As we did for psych, we will first need to install rio.

install.packages("rio")

Second, we will need to load rio.

library(rio)

Once this is done, importing the data is as easy as passing the location of the downloaded file to import and saving the data into a variable (called planets_data here). In this case, the sw_planets.xlsx was saved to my downloads folder. If it was in a folder called data_sets on my desktop, I would have used "~/Downloads/data_sets/sw_people.xlsx" as the argument.

planets_data <- import("~/Downloads/sw_planets.xlsx")

The tilde (~) in the above string is a shortcut for the home directory on my computer. On my computer, it represents /Users/cameronkay. Replacing the tilde with the path to the home directory should have the exact same result as using the tilde.

planets_data <- import("/Users/cameronkay/Downloads/sw_planets.xlsx")

To ensure it was read in properly, we can look at the first six rows of the imported dataset by using the head() function and the last six rows by using the tail() function.

head(people_data)
tail(people_data)

Minihacks

Minihack 1: Create an R Markdown Document

  1. Create an R Markdown document called lab1_minihacks.

Minihack 2: Arithmetic Commands

  1. Use R to solve for \(x\): \[x = \frac{(102 + 68) \times (3 + 2) + 1250}{50}\]

  2. Assign the \(x\) to a variable called x.

  3. Assign the numbers 10, 20, and 30 to a vector called y.

  4. Add x to the length of y (hint. Use length()), and assign the result to a variable called z.

Minihack 3: Functions

  1. Assign the string "I AM NOT YELLING" to a variable called exclamation.

  2. Use the function tolower() to uncapitalize exclamation. Assign the result to exclamation.

  3. Use the capitalize() function from the Hmisc package to capitalize the first letter of exclamation (hint. You will need to install and load Hmisc).

Minihack 4: Accessing Help Documentation

  1. I wanted to create a vector of 5 values between 10 and 50 using seq(), but the code I wrote is creating a vector of 9 values between 10 and 50. I believe it has something to do with the arguments I used, but I can’t remember how to access the help documention to check. Without changing the values, can you fix my code?
seq(10, 50, 5)
## [1] 10 15 20 25 30 35 40 45 50

Minihack 5: Data Frames

  1. Download the Star Wars’ people dataset.

  2. Import the data into R and assign it to a variable called people_data.

  3. Using $ and mean(), calculate the mean age of the characters from Star Wars. (hint. You may need to supply an additional argument to exclude NAs).

  4. Extract Lando Calrissian’s homeworld from the dataset using square brackets ([]) and assign to a variable called landos_homeworld.